The dataset is from Kaggle, collected by Austin Reese. which contains more than 500,000 used cars selling records in the US from craigslist.
The goal of this project is to explore the dataset and discuss some interesting observations through visualizations and train machine learning models to fit and predict the prices of the used cars using supervised learning.
It is also a great dataset to practice my data cleaning and transformation skills, as there are many natural problems to solve before analysis in the dataset, such as outliers, null values, categorical features and imbalanced features. As part of the project, we need to clean the data to make it easier to use.
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
import pickle
from joblib import dump, load
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from tensorflow.keras import Sequential, callbacks
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.keras.layers import Dense, Dropout, Activation
pd.set_option('display.max_rows',30)
%matplotlib inline
df = pd.read_csv('vehicles.csv')
df.describe()
From this basic statistical infomation of the dataset, we can find that:
We will talk about the remaining categorical features in the next section. Nevertheless, according to our finding above, this dataset must be cleaned before we start applying any analysis on it.
df.info()
There are also a lot of missing values among features.
df.head()
Many columns have nothing to do with the prices of the car, such as "id", "url", "image_url". We will remove columns like this in the next section.
Fist we need to remove the outliers of the prices. See the output from df.describe(), the price range between 25% to 75% quantile of our dataset is 4400 ~ 17926, but the maximum price is 4.3*10^9, which is too high for a price!
fig, ax = plt.subplots(figsize=(10,2))
ax.set_title('Box plot of the prices')
sns.boxplot(x='price', data = df)
There indeed are some extremely large prices in our box plot. They are too large for us to actually see the "box". We have to remove them. The common way to remove the outliers is using interquartile range showing below.
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
filter = (df['price'] >= Q1 - 1.5 * IQR) & (df['price'] <= Q3 + 1.5 *IQR)
init_size = df.count()['id']
df = df.loc[filter]
filtered_size = df.count()['id']
print(init_size-filtered_size,'(', '{:.2f}'.format(100*(init_size-filtered_size)/init_size), '%',')', 'outliers removed from dataset')
With 3% data loss, we now get a much better distribution of prices.
But when we draw the distribution plot of the prices:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_title('Distribution of the prices')
sns.distplot(df['price'], bins=30, kde=False)
We can see that there are a large amount of wierd "free cars" in our dataset. We have to remove them as well.
Here we set a threshhold of $600.
df = df[df['price']>600]
Now let's see whether there are outliers to remove in the odometer column.
fig, axs = plt.subplots(2, figsize=(20,10))
sns.distplot(df['odometer'], ax = axs[0])
axs[0].set_title('Distribution of the odometer')
axs[1].set_title('Box plot of the odometer')
sns.boxplot(x='odometer', data = df, ax=axs[1])
Same problem happens to the odometer column. The outliers could come from two situations:
However, both situations cause uncertainty to the prediction of our future model, thus we remove these outliers using the same method as above. The only differece is our filter:
Q1 = df['odometer'].quantile(0.25)
Q3 = df['odometer'].quantile(0.75)
IQR = Q3 - Q1
filter = (df['odometer'] <= Q3 + 3 *IQR)
init_size = df.count()['id']
df = df.loc[filter]
filtered_size = df.count()['id']
print(init_size-filtered_size,'(', '{:.2f}'.format(100*(init_size-filtered_size)/init_size), '%',')', 'outliers removed from dataset')
Although we give up 18% of our data, the odometer column now should look normal.
Next we need to drop some uncorrelated columns. They are columns that we are pretty sure having no correlations to the car prices. This include the id, url, region, region_url, title_status, VIN, image_url, description, county (as it is an empty column). As for its state and cordinate, we need to make sure they don't have high correlation to the price.
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Geographical distribution of the cars colored by prices')
sns.scatterplot(x= 'long', y='lat', data = df, hue = 'price', ax=ax )
We can see that most of the used cars are being sold in the US, some others are in other parts of the world. Although the high price cars are not seperated uniformly, the prices don't seem to vary much geographically, indicating that there are not significant correlations between prices and coordinates.
fig, ax = plt.subplots(figsize=(15,10))
ax.set_title('Mean car price of each state')
df.groupby(['state']).mean()['price'].plot.bar(ax=ax)
The prices don't differ very much in each states in the US either. But we do find that the used car prices are slightly higher in the states neighboured on Canada (such as Alaska, Idaho, Washington, Montana, North Dakota). See the figure above, where blue indicates higher average prices, while yellow indicates lower average prices. The second figure is drawn in Power BI.
As locations do not have major correlation to the prices, we choose to remove 'long', 'lat' and 'state' as well.
df = df.drop(columns = ['id', 'url', 'region', 'region_url', 'title_status', 'vin', 'image_url', 'description', 'county', 'state', 'long', 'lat'])
df.head()
Now the feature columns are all we believe correlated to the prices.
Now that we have column 'model', there is no need for the column 'manufacturer' for modeling (as normally one car model only belongs to one manufacturer). But as it is still needed for further visualization, we store this column into a new data frame (df_man) before removing it.
df_man = df['manufacturer'].to_frame()
df = df.drop(columns = ['manufacturer'])
Now let's check the distribution of the null values among all columns.
fig, ax = plt.subplots(figsize=(8,6))
ax.set_title('Distribution of the missing values (yellow records)')
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Based on the situation that there are plenty of null values in our dataset, and the fact that these missing values are hard to fill with proper guesses. we decide to take the following three actions:
df = df.drop(columns = ['size'])
rm_rows = ['year', 'model', 'fuel', 'transmission', 'drive', 'type', 'paint_color']
for column in rm_rows:
df = df[~df[column].isnull()]
df = df.replace(np.nan, 'null', regex=True)
df.info()
fig, ax = plt.subplots(figsize=(8,6))
ax.set_title('Distribution of the missing values (yellow records)')
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Now there is no missing values in our dataset.
Let's first see the relationship between prices and mileages (odometer).
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Scatter plot between mileages and prices')
sns.scatterplot(x='odometer', y='price', data=df)
Higher odometer tends to have lower prices, while lower odometer tends to be more expensive. Also, a funny observation is that people tend to set their prices easy to remember such as 5000, 5100, 8999 (there are many horizontal straight lines in the scatter plot above).
However, there are some relatively new cars were sold nearly for free, which is against common sense. Thus we need to do some intervention to it.
df = df[(df['price']+df['odometer'])>5000]
Cars that are too old (let's say earier than 1960) will increase uncertainty to our data prediction, because of the insufficient amount and probably unstable prices (some of them can be regarded as antiques). So we remove the samples older than 1960.
df = df[df['year']>1960]
df_man['manufacturer'].value_counts()
Some brands have too few samples.
We remove the manufacturers which hold less than 100 records.
rm_brands = ['harley-davidson', 'alfa-romeo', 'datsun', 'tesla', 'land rover', 'porche', 'aston-martin', 'ferrari']
for brand in rm_brands:
df_man = df_man[~(df_man['manufacturer'] == brand)]
Now let's head to the car models. For the precision of our future model, we choose to remove the car models which have less than 50 samples. It will narrow the capability of our model, but in return lower the bias and variance.
df = df.groupby('model').filter(lambda x: len(x) > 50)
df['model'].value_counts()
df.info()
fig, axs = plt.subplots(2, figsize=(14, 10))
axs[0].set_title('Box plot of the prices')
sns.boxplot(x='price', data = df, ax = axs[0])
axs[1].set_title('Distribution of the prices')
sns.distplot(df['price'], ax=axs[1], bins=30, kde=False)
We see that after the data cleaning, the distribution of the prices looks better. Most of the cars are sold in price between
fig, axs = plt.subplots(2, figsize=(14, 10))
sns.distplot(df['odometer'], ax = axs[1], bins=30, kde=False)
axs[1].set_title('Box plot of the odometer')
sns.boxplot(x='odometer', data = df, ax=axs[0])
axs[0].set_title('Distribution of the odometer')
Now let's see how differnt factors influence the car prices
df['paint_color'].value_counts()
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Box plot of the prices on each color')
sns.boxplot(x='paint_color', y='price', data = df)
Besides customized colors, there are 11 different common colors in the dataset. It seems that white, black, orange and yellow cars are the top 4 colors ranked by their median prices. By contrast, green and purple are the least welcome colors. Note that due to relatively fewer samples for purple, yellow and orange, the statement above may not totally correct.
print(df['type'].value_counts())
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Box plot of the prices on each car type')
sns.boxplot(x='type', y='price', data = df)
Pickup cars, trucks and buses have higher prices as they cost higher for new. The prices for sedan, wagon, hatchback and mini-van are more stable.
print('Condition:')
print(df['condition'].value_counts())
print('\nCylinders:')
print(df['cylinders'].value_counts())
print('\nFuel:')
print(df['fuel'].value_counts())
print('\nTransmission:')
print(df['transmission'].value_counts())
print('\nDrive:')
print(df['drive'].value_counts())
fig=plt.figure(figsize=(25,37))
fig.add_subplot(3, 2, 1)
sns.boxplot(x='condition', y='price', data = df)
fig.add_subplot(3, 2, 2)
sns.boxplot(x='cylinders', y='price', data = df)
fig.add_subplot(3, 2, 3)
sns.boxplot(x='fuel', y='price', data = df)
fig.add_subplot(3, 2, 4)
sns.boxplot(x='transmission', y='price', data = df)
fig.add_subplot(3, 2, 5)
sns.boxplot(x='drive', y='price', data = df)
From these five figures we conclude the following reasonable phenomena:
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Scatter plot of the prices in each year, colored by transmission type')
#sns.scatterplot(x='year', y='price', data=df[(df['transmission'] =='manual') | (df['transmission'] =='other')], hue = 'transmission')
sns.scatterplot(x='year', y='price', data=df, hue = 'transmission')
Higher prices are more likely to be seen in newer cars. Older cars tend to be cheaper (but not always).
Although there are more and more cars with automatic transmission, manual types are still very poplular. Also, there appears to be more cars with "other" transimission since 2010. I guess they include new type gearboxes such as CVT and DCT.
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Scatter plot of the prices in each year, colored by drive type')
sns.scatterplot(x='year', y='price', data=df, hue = 'drive')
We see that most of the old cars (older than 1990) are rear wheel drive. Front wheel drive and all wheel drive started to get popular since 1990. Cars with AWD are obviously more expensive than cars with FWD, which is quite reasonable. There are still some expensive cars with RWD, I guess it is because RWD are equipped by many sports car nowadays.
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Scatter plot of the prices in each year, colored by fuel type')
sns.scatterplot(x='year', y='price', data=df, hue = 'fuel')
It is obvious to see that diesel cars tend to be more expensive than gas cars. One of the main reason could be diesel engines are usually more expensive and are mostly equipped by high power vehicles such as trucks, buses and upmarket SUVs. A strange phenomenon is that although diesel engines appear far earlier than gas engines, we can hardly see them in the used cars older than 1990.
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Scatter plot of the prices in each year, colored by cylinder type')
sns.scatterplot(x='year', y='price', data=df, hue = 'cylinders')
We can see that most of the cars have 8 cylinders before 1990. In recent years, upmarket is occupied by 6 and 8-cylinder cars and relatively cheap cars are mainly 4-cylinder type.
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Scatter plot of the odometer of the cars in each year, colored by car condition')
sns.scatterplot(x='year', y='odometer', data=df, hue = 'condition')
Higher odometer are more likely to be seeing in cars around 2005. It is reasonable as newer cars have not yet to accumulate high mileage, while older cars need to have lower mileage to be sold in a desent price in the used car market. If a car has both high age and mileage, then the owner may directly sell the parts to the repair shop rather than selling it as a used car.
Also, we can not see straight forward correlation between "condition" and "odometer" (or "year"), which is a surprise. Myabe different people hold different views toward "condition" as it is quite subjective.
fig, ax = plt.subplots(figsize=(35,15))
ax.set_title('Count plot of all cars group by manufacturer')
sns.countplot(x='manufacturer', data=df_man)
From the table we see that the top five popular brands in the used car market are Ford, Chevrolet, Toyota, Nissan and Jeep.
fig, ax = plt.subplots(figsize=(12,10))
ax.set_title('Person correlations among prices, years and mileages')
sns.heatmap(df.corr())
This is another way to see the high correlations among years mileages and prices
Before feeding those data to the machine learning models to predict prices, we still need to finish the following preparation tasks.
We transform the string values in all categorical features into numeric values using One Hot Encoding scheme. Note that the "drop_first" parameter in "get_dummies" is to remove the first column for each feature to avoid the collinearity problem because of the correlations among each newly created columns.
cate_Columns = ['model', 'condition', 'cylinders', 'fuel', 'transmission', 'drive', 'type', 'paint_color']
for column in cate_Columns:
column = pd.get_dummies(df[column],drop_first=True)
df = pd.concat([df,column],axis=1)
df = df.drop(columns = cate_Columns)
df.head()
We also need to normalize the values in the numerical features ("year" and "odometer"), as they do not have the same scale as the other newly created columns.
std_scaler = StandardScaler()
for column in ['year', 'odometer']:
df[column] = std_scaler.fit_transform(df[column].values.reshape(-1,1))
df.head()
We set 70% of the data to be the training set, leaving the remaining for testing.
X_train, X_test, y_train, y_test = train_test_split(df.drop('price',axis=1),
df['price'], test_size=0.30,
random_state=141)
In this section, we are going to create and train several machine learning models to see their performance in this used car dataset for price prediction.
As it is a regression problem, we use R2 score and root mean squared error as the way to evaluate our models.
model_score = pd.DataFrame(columns=('r2', 'rmse'))
In Scikit Learn, LinearRegression() is using Ordinary Least Squares to calculate the coefficients in the linear regression model, without the "learning" process through gradient descent.
lrmodel = LinearRegression()
lrmodel.fit(X_train,y_train)
lr_predict = lrmodel.predict(X_test)
lr_r2 = metrics.r2_score(y_test, lr_predict)
lr_rmse = math.sqrt(metrics.mean_squared_error(y_test, lr_predict))
model_score = model_score.append(pd.DataFrame({'r2':[lr_r2], 'rmse':[lr_rmse]}, index = ['Linear Regression']))
print('For the linear regressor, the root mean square error for the testing set is:', lr_rmse)
print('The r2 score for the testing set is:', lr_r2)
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Comparison between predicted prices and actual prices in testing set, linear regrssion')
plt.scatter(y_test, lr_predict)
We can see that linear regression model does not perform well in the dataset. To show whether it is overfitting, we calculate the score for the training set:
lr_predict_train = lrmodel.predict(X_train)
lr_r2_train = metrics.r2_score(y_train, lr_predict_train)
lr_rmse_train = math.sqrt(metrics.mean_squared_error(y_train, lr_predict_train))
print('For the linear regressor, the root mean square error for the training set is:', lr_rmse_train)
print('The r2 score for the testing set is:', lr_r2_train)
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Comparison between predicted prices and actual prices in training set, linear regrssion')
plt.scatter(y_train, lr_predict_train)
With similar rmse and r2 score, it seems that overfitting is not the probelm.
Lasso Regression is a linear regression model with L1 regularization factor to eliminate the errors caused by the collinearity problem and overfitting. We choose 12 regularization coefficents and choose the best via cross validation method.
Although we see from above that the poor performance of the linear model is not caused by overfitting, we still decide to give Lasso a try.
alphas = np.logspace(-4,4,12)
lasso = LassoCV(max_iter=10**6, alphas=alphas)
lasso.fit(X_train, y_train)
lasso_predict = lasso.predict(X_test)
lasso_r2 = metrics.r2_score(y_test, lasso_predict)
lasso_rmse = math.sqrt(metrics.mean_squared_error(y_test, lasso_predict))
model_score = model_score.append(pd.DataFrame({'r2':[lasso_r2], 'rmse':[lasso_rmse]}, index = ['Lasso Regression']))
print('For the Lasso linear regressor, the root mean square error for the testing set is:', lasso_rmse)
print('The r2 score for the testing set is:', lasso_r2)
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Comparison between predicted prices and actual prices in testing set, Lasso regrssion')
plt.scatter(y_test, lasso_predict)
Within expectation, the result is very close to the previous linear regression model.
With r2 = 0.801, it seems like the best a linear model can do, unless we do some deeper processing with the dataset itself.
Here we train a fully connected neuron network for regression using Keras.
To change the model from solving classification problem to regression problem, we first change the output layer to be a single cell. Then we change the loss function from cross-entropy to be MSE.
callback = callbacks.EarlyStopping(monitor='loss', patience=3)
nn_model = Sequential()
nn_model.add(Dense(input_dim = X_train.shape[1], units = 2000, activation = 'relu'))
#nn_model.add(Dropout(0.3))
nn_model.add(Dense(units = 2000, activation = 'relu'))
nn_model.add(Dense(units=1))
nn_model.compile(loss='mean_squared_error', optimizer = 'adam', metrics=['mae', 'mse'])
nn_model.fit(X_train, y_train, batch_size=5000, epochs=800, callbacks=[callback], verbose=0)
nn_predict = nn_model.predict(X_test)
nn_rmse = math.sqrt(metrics.mean_squared_error(y_test, nn_predict))
nn_r2 = metrics.r2_score(y_test, nn_predict)
model_score = model_score.append(pd.DataFrame({'r2':[nn_r2], 'rmse':[nn_rmse]}, index = ['MLP']))
print('For the MLP model, the root mean square error for the testing set is:', nn_rmse)
print('The r2 score for the testing set is:', nn_r2)
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Comparison between predicted prices and actual prices in testing set, MLP')
plt.scatter(y_test, nn_predict)
With r2 = 0.915 and rmse = 2476 on the training set, this MLP regressor is the best model for now.
Sometimes KNN can achieve high accuracy, with the cost of time. As it is a type of "lazy study" model, the predicting time is very long even if the training is done.
We use GridSearchCV() to find the best number of neighbors via cross validation.
knnReg = KNeighborsRegressor()
param_grid = [
{
'weights':['uniform'],
'n_neighbors':[i for i in range(1,7)]
}]
grid_search_knn = GridSearchCV(knnReg, param_grid,n_jobs=-1,verbose=2)
grid_search_knn.fit(X_train, y_train)
knn_best = grid_search_knn.best_estimator_
knn_best
The best number of neighbors for our knn model is 4.
knn_predict = knn_best.predict(X_test)
knn_r2 = metrics.r2_score(y_test, knn_predict)
knn_rmse = math.sqrt(metrics.mean_squared_error(y_test, knn_predict))
model_score = model_score.append(pd.DataFrame({'r2':[knn_r2], 'rmse':[knn_rmse]}, index = ['K - Nearest Neighbor']))
print('For the K-NN regressor, the root mean square error for the testing set is:', knn_rmse)
print('The r2 score for the testing set is:', knn_r2)
fig, ax = plt.subplots(figsize=(20,15))
plt.scatter(y_test, knn_predict)
We see that KNN model does achive a pretty high accuracy.
In the decision tree model, we did not set the maximal depth or leaf nodes.
dt_model = DecisionTreeRegressor(random_state=0)
dt_model.fit(X_train, y_train)
dt_predict = dt_model.predict(X_test)
dt_r2 = metrics.r2_score(y_test, dt_predict)
dt_rmse = math.sqrt(metrics.mean_squared_error(y_test, dt_predict))
model_score = model_score.append(pd.DataFrame({'r2':[dt_r2], 'rmse':[dt_rmse]}, index = ['Decision Tree']))
print('For the decision tree regressor, the root mean square error for the testing set is:', dt_rmse)
print('The r2 score for the testing set is:', dt_r2)
fig, ax = plt.subplots(figsize=(20,15))
plt.scatter(y_test, dt_predict)
ranF_model = RandomForestRegressor(max_depth=8, random_state=0)
ranF_model.fit(X_train, y_train)
ranF_predict = ranF_model.predict(X_test)
ranF_r2 = metrics.r2_score(y_test, ranF_predict)
ranF_rmse = math.sqrt(metrics.mean_squared_error(y_test, ranF_predict))
model_score = model_score.append(pd.DataFrame({'r2':[ranF_r2], 'rmse':[ranF_rmse]}, index = ['Random Forest']))
print('For the random forest regressor, the root mean square error for the testing set is:', ranF_rmse)
print('The r2 score for the testing set is:', ranF_r2)
fig, ax = plt.subplots(figsize=(20,15))
plt.scatter(y_test, ranF_predict)
The random forest model does not give a good performance. It is even worse than the decision tree. Maybe the tree is not deep enough.
We tried two different kernel for the support vector machine, Gaussian and linear.
svr_model = SVR(C = 1, epsilon = 0.2, kernel = 'rbf', max_iter=10000)
svr_model.fit(X_train, y_train)
svr_predict = svr_model.predict(X_test)
svr_r2 = metrics.r2_score(y_test, svr_predict)
svr_rmse = math.sqrt(metrics.mean_squared_error(y_test, svr_predict))
model_score = model_score.append(pd.DataFrame({'r2':[svr_r2], 'rmse':[svr_rmse]}, index = ['SVM_gaus']))
print('For the support vector regressor with gaussian kernel, the root mean square error for the testing set is:', svr_rmse)
print('The r2 score for the testing set is:', svr_r2)
fig, ax = plt.subplots(figsize=(20,15))
plt.scatter(y_test, svr_predict)
svr_model2 = SVR(C = 1, epsilon = 0.2, kernel = 'linear', max_iter=10000)
svr_model2.fit(X_train, y_train)
svr_predict2 = svr_model2.predict(X_test)
svr2_r2 = metrics.r2_score(y_test, svr_predict2)
svr2_rmse = math.sqrt(metrics.mean_squared_error(y_test, svr_predict2))
model_score = model_score.append(pd.DataFrame({'r2':[svr2_r2], 'rmse':[svr2_rmse]}, index = ['SVM_linear']))
print('For the support vector regressor with linear kernal, the root mean square error for the testing set is:', svr2_rmse)
print('The r2 score for the testing set is:', svr2_r2)
fig, ax = plt.subplots(figsize=(20,15))
plt.scatter(y_test, svr_predict2)
For both cases, the models are forced to stop with our maximal iteration setting, or they will run forever... As a result, they both perform badly.
gb_model = GradientBoostingRegressor(random_state=0)
gb_model.fit(X_train, y_train)
gb_predict = gb_model.predict(X_test)
gb_r2 = metrics.r2_score(y_test, gb_predict)
gb_rmse = math.sqrt(metrics.mean_squared_error(y_test, gb_predict))
model_score = model_score.append(pd.DataFrame({'r2':[gb_r2], 'rmse':[gb_rmse]}, index = ['GBDT']))
print('For the gradient boosting regressor, the root mean square error for the testing set is:', gb_rmse)
print('The r2 score for the testing set is:', gb_r2)
fig, ax = plt.subplots(figsize=(20,15))
plt.scatter(y_test, gb_predict)
xgb_model = XGBRegressor()
df_noDuplicate = df.loc[:,~df.columns.duplicated()]
X_train_nd, X_test_nd, y_train_nd, y_test_nd = train_test_split(df_noDuplicate.drop('price',axis=1),
df['price'], test_size=0.30,
random_state=141)
xgb_model.fit(X_train_nd, y_train_nd)
xgb_predict = xgb_model.predict(X_test_nd)
xgb_r2 = metrics.r2_score(y_test_nd, xgb_predict)
xgb_rmse = math.sqrt(metrics.mean_squared_error(y_test_nd, xgb_predict))
model_score = model_score.append(pd.DataFrame({'r2':[xgb_r2], 'rmse':[xgb_rmse]}, index = ['XGBoost']))
print('For the XGboosting regressor, the root mean square error for the testing set is:', xgb_rmse)
print('The r2 score for the testing set is:', xgb_r2)
fig, ax = plt.subplots(figsize=(20,15))
plt.scatter(y_test_nd, xgb_predict)
XGBoost performs pretty well, just behind neuron network.
model_score.sort_values(by=['r2'], ascending=False)
model_score = model_score1
From the dataframe above, we conclude that the neuron network performs the best in this dataset. But even so, the rmse is still around 2600, which is pretty large compared to the actual prices.
There are many reason for this. For example, we did not play very much with the parameters in the models above, indicating the models could do better with proper parameters. Also, the dataset may not be clean enough. There are some particular values in the categorical features that are insufficent (e.g. there are only six 12-cylinder cars and 165 electric cars in total). Some columns are still correlated. A simple example is that most Subaru models (in the "model") are also AWD (in the "drive"). But most importantly, according to the scatter plot between the predicted prices and real prices, there are always some points that have very large differences between the two. I think the future job to make these model better is to output these outliers and analze the causes of this huge bias.